High-performance log-based processing

ABSTRACT

Each of a plurality of Worker processes are allowed to perform any and all of the following tasks involving logged work items: (1) reading a subset of the work items from a log; (2) sequentially ordering work items for corresponding data objects; (3) applying a sequentially ordered set of work items to a corresponding data object; and (4) transmitting a subset of work items to a Worker process running on another database server in a cluster, if necessary. These tasks can be performed concurrently, at will, and as available, by the Worker processes. An improved checkpointing technique eliminates the need for the Worker processes to get to a synchronization point and stop. Instead, a Coordinator process examines the current state of progress of the Worker processes and computes a past point in the sequence of work items at which all work items before that point have been completely processed, and records this point as the checkpoint.

FIELD OF THE INVENTION

The present invention relates generally to log-based processing and,more specifically, to techniques for parallel processing of logs of workitems representing ordered operations on data objects.

BACKGROUND OF THE INVENTION

With log-based processing, work needs to be performed based on adescription of the work in a set of records that are stored in a log. Anexample of log-based processing is system recovery processing. Inlog-based recovery, the log records represent a sequence of work itemsthat are ordered operations on a set of objects. Specifically, the logrecords may be redo records that represent changes made to data items ina database prior to a system failure. Generally, recovering the systembased on the log entails repeating the processing of the logged workitems on the objects.

One context in which log-based processing may be performed is forrecovery of a database system after a failure or inadvertent terminationwithin the system. In the context of database recovery, the log is aredo log that records changes made during transactions on a set ofobjects. Some of the changes recorded in the redo log have beencommitted but not yet flushed to disk at the time of the failure. Theset of objects are database objects, such as tables, rows, views,indexes, and the like. Thus, recovering the database system based on theredo log entails reapplying, to the database objects, changes reflectedin the work items. Another context for log-based processing is recoveryafter media loss or persistent (disk) data corruption. This type ofrecovery typically involves restoring a backup of the data and thenapplying the log to replay all the changes since the time at which thebackup was taken.

Use of redo logs for system recovery is described in U.S. Pat. No.5,832,516 to Bamford et al., entitled “Caching data in recoverableobjects”; U.S. Pat. No. 6,507,853 to Bamford et al., entitled“Recovering data from a failed cache using recovery logs of caches thatupdated the data”; U.S. Pat. No. 6,609,136 to Bamford et al., entitled“Recovering data from a failed cache using a surviving cache”; U.S. Pat.No. 6,507,853 to Bamford et al., entitled “Recovering data from a failedcache using recovery logs of caches that updated the data”; the contentsof all of which are incorporated by reference in their entirety for allpurposes as if fully set forth herein.

Log-based processing is not always in the context of system recovery.Rather, log-based processing may also be performed to repeat logged workon another system. For example, log-based processing may be performed toconstruct and maintain a standby database system. Approaches toconstructing standby databases and processing redo records are describedin U.S. patent application Ser. No. 10/308,851 filed on Dec. 2, 2002 bySubramaniam, entitled “Replicating DDL Changes Using Streams”; U.S.patent application Ser. No. 10/308,879 filed on Dec. 2, 2002 by Arora etal., entitled “In Memory Streaming With Disk Backup and Recovery ofMessages Captured From a Database Redo Stream”; U.S. patent applicationSer. No. 10/308,924 filed on Dec. 2, 2002 by Souder et al., entitled“Asynchronous Information Sharing System”; U.S. patent application Ser.No. 10/443,206 filed on May 21, 2003 by Jain et al., entitled “BufferedMessage Queue Architecture for Database Management Systems”; U.S. patentapplication Ser. No. 10/449,873 filed on May 30, 2003 by Lu et al.,entitled “Utilizing Rules in a Distributed Information Sharing System”;the contents of all of which are incorporated by this reference in theirentirety for all purposes as if fully set forth herein.

Typical approaches to log-based processing fall into two maincategories. The first category involves serial schemes. With serialschemes, a single recovery process reads through the sequence of workitems in the log and performs the work on the objects, one work item ata time. In large-scale systems with abundant resources, such a schemedoes not take advantage of the available resources and leads tounder-utilization of the system resources. For example, when there aremultiple CPUs in the system, the recovery process runs in only one ofthe CPUs and the other CPUs are not utilized. Furthermore, serialschemes are not able to effectively overlap the CPU and I/O componentsof recovery processing.

The second category of log-based processing involves parallel schemes.With parallel schemes, multiple processes work together in parallel toperform log-based recovery. However, such schemes typically allocatespecific tasks to named processes, thus limiting the flexibility of theentire architecture. In particular, a single process acts as theCoordinator for the log processing session. The Coordinator is assignedthe task of reading through the entire sequence of work items andassigning the work to be performed to other processes known as Workerprocesses. Because there are no ordering constraints with respect towork processing that need to be honored across any two differentobjects, the entire work represented in the log is partitioned by theCoordinator, based on the objects on which the work needs to beperformed, prior to assigning partitions of work to the Workerprocesses.

In situations in which the number of objects is much larger than thenumber of Worker processes (typically the case in many systems), eachWorker process can be assigned a subset of the objects to work on. TheCoordinator process directs all the work corresponding to an object tothe Worker process that handles the subset of objects in which thisobject belongs. The Worker process can then process work on its objectsin the order in which it receives work items from the Coordinator, thushonoring a total-ordering constraint for work processing on any givenobject. However, even though the work processing is handled by a set ofprocesses in parallel, there is significant under-utilization of systemresources. For example, the Coordinator process often becomes thebottleneck as it struggles to identify and extract work from the log andto assign the work to a large number of relatively idle Workerprocesses. Furthermore, parallel schemes typically utilize specializedWorker processes that either perform only CPU-based operations or onlyIO operations.

The Coordinator process is responsible for synchronization tasks,including the need to periodically “checkpoint” the work beingperformed. During log-based processing, the processing of work itemsneeds to be periodically checkpointed in order to minimize lost workupon resumption of processing after a failure of the original processingsession. Processing is checkpointed by identifying and storing a commonpoint, in the processing of the log, which all processes have reached.With such synchronization checkpoints, the Coordinator processidentifies a common point in the set of work items for the variousobjects, and ensures that all Worker processes complete work up to thatpoint. That is, all log processing is completed for all the work itemsup to that point in the set of work items, and no work is performed onany work items beyond that point in the set of work items.

Once all the processes reach the checkpoint, the Coordinator processtakes appropriate action, such as saving the state of the objects, andresumes the processing of the work item via the Worker processes. Thisapproach to handling points of synchronization leads to significantresource under-utilization because the Worker processes that arefinished with their work ahead of other Worker processes, i.e., theprocesses that reach the juncture before the other processes, cannotcontinue processing more work items until every process has reached thepoint of synchronization.

One approach to using checkpoints in managing shared resources isdescribed in U.S. Pat. No. 6,567,827 to Bamford et al., entitled “Usinga checkpoint to manage data that is shared by a plurality of nodes”; thecontents of which is incorporated by reference in its entirety for allpurposes as if fully set forth herein.

Parallel schemes for log-based recovery are unable to fully utilizeglobal system resources (particularly in configurations involvingdistributed clusters of CPUs and memory units) because critical-pathcoordination work remains centralized in a single Coordinator processand, consequently, in a single node of the distributed cluster.

Based on the foregoing, there is room for improvement in the performancecharacteristics of log-based processing.

The approaches described in this section are approaches that could bepursued, but not necessarily approaches that have been previouslyconceived or pursued. Therefore, unless otherwise indicated, it shouldnot be assumed that any of the approaches described in this sectionqualify as prior art merely by virtue of their inclusion in thissection.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the present invention are illustrated by way of example,and not by way of limitation, in the figures of the accompanyingdrawings and in which like reference numerals refer to similar elementsand in which:

FIG. 1 is a block diagram that illustrates an operating environment inwhich an embodiment of the invention may be implemented;

FIG. 2 is a flow diagram that illustrates a method for processing asequence of work items from a log, according to an embodiment of theinvention;

FIG. 3A is a block diagram that illustrates a system performing a methodfor processing sequences of work items from logs, according to anembodiment of the invention;

FIG. 3B is a block diagram that illustrates a system performing a methodfor processing sequences of work items from logs, according to anembodiment of the invention; and

FIG. 4 is a block diagram that illustrates a computer system upon whichan embodiment of the invention may be implemented.

DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

In the following description, for the purposes of explanation, numerousspecific details are set forth in order to provide a thoroughunderstanding of embodiments of the invention. It will be apparent,however, that embodiments of the invention may be practiced withoutthese specific details. In other instances, well-known structures anddevices are shown in block diagram form in order to avoid unnecessarilyobscuring embodiments of the invention.

Functional Overview of Embodiments

Embodiments of the invention provide enhanced performance with log-basedprocessing, by allowing each of a plurality of Worker processes toperform any and all of the following tasks involving logged work itemsthat are each associated with a particular data object or data block:(1) reading a subset of the work items from a log; (2) sequentiallyordering work items for corresponding data objects; (3) applying asequentially ordered set of work items to a corresponding data object;and (4) in some scenarios, such as with database clusters, transmittinga subset of the work items to a Worker process running on anotherclustered database server instance. These tasks can be performedconcurrently, as available and at will, by the Worker processes.

In general, there is much less synchronization and coordination requiredof the Coordinator process and much less idle time for the Workerprocesses, than with other approaches. Consequently, the Coordinatorprocess workload is significantly smaller, compared with previousapproaches involving Coordinator processes that read the work items fromthe log, order the work items for corresponding data objects and sendthese sequences of work items to the Worker processes that actuallyapply changes to the data objects. Therefore, the Coordinator processceases to be a bottleneck in parallel processing frameworks, leading tobetter degrees of scalability. In addition, the Worker processes arefree to move from task to task at will, which results in significantlybetter utilization of resources and improved performance for log-basedprocessing.

An improved checkpointing technique further reduces the burdens ofsynchronization in parallel work processing by eliminating the need forthe Coordinator process to (1) identify a future point ofsynchronization in the sequence of work items and (2) require the Workerprocesses to get to that point and stop, as with other approaches.Instead, in one embodiment, the Coordinator process examines the currentstate of progress of the Worker processes and computes a past point inthe sequence of work items at which all work items before that pointhave been completely processed, and records this point as thecheckpoint. Hence, the Coordinator process does not require any Workerprocess to stop working and wait until all other Worker processes reacha predetermined point of synchronization. Again, a higher degree ofresource utilization is achieved as Worker processes continue to performwithout stopping for checkpoint synchronization.

Operating Environment

FIG. 1 is a block diagram that illustrates an operating environment inwhich an embodiment of the invention may be implemented. FIG. 1 depictsa multi-node database system 100 that includes multiple database servers102a-102n (i.e., instances of a database server, which are at timesreferred to as “server instances,” or “clustered server instances” whenconfigured in a database cluster) that are communicativelyinterconnected to one another via a network 105. Each of these databaseservers 102a-102n is communicatively coupled to a database 104. In theprocess of managing data in database 104, each of these database servers102a-102n generates a respective log 103a-103n, such as a redo log.

Embodiments of the invention are not limited to use in a multi-nodesystem as illustrated in FIG. 1. Rather, the techniques described hereinare applicable to single node systems as well. In addition, embodimentsof the invention are not limited to use in a database system 100 asillustrated in FIG. 1. Rather, the techniques described herein areapplicable to other systems in which log-based processing is performed,such as operating systems for computer systems (e.g., computer system400 of FIG. 4) and data storage systems (e.g., systems that managestorage disks or volumes, storage area networks, and the like).

The example operating environment 100 includes database servers102a-102n and a database 104. Each database server (“server”) 102a-102ncomprises a combination of integrated software components and anallocation of computational resources (such as memory and processes) forexecuting the integrated software components on one or more processors,where the combination of the software and computational resources areused to manage a particular database on behalf of clients of the server.Among other functions of database management, a database server governsand facilitates access to a particular database, such as database 104,by processing requests by clients to access the database. Each databaseserver 102a-102n operates to parse, interpret and manage execution ofdatabase statements, e.g., SQL queries, on database 104.

When configured together in a clustered database, each database server102a-102n (which may be referred to as a “clustered database instance”)is communicatively interconnected via network 105 to the other serversin the cluster, to operate on shared resources persistently stored indatabase 104. Each shared resource is typically mastered by one of theservers 102a-102n. The master of a resource has access to the datastructures associated with the resource, including distributed lockmanagement information for the resource, and manages access to theresource by other servers.

Database 104 is communicatively coupled to servers 102a-102n via network105 and is a repository for storing data and metadata on a persistentmemory mechanism, such as a set of hard disks. Such data and metadatamay be stored in database 104 logically, for example, according torelational schema, multidimensional schema, or a combination ofrelational and multidimensional schema.

During a client session with database 104, through any one of databaseservers 102a-102n, transactions can be performed on resources fromdatabase 104. As part of the management of the resources, each server102a-102n maintains a log 103a-103n to track the evolution of theresources by recording information that describes the changes made tothe resources via the transactions. For example, redo logs and undo logsare maintained by the servers 102a-102n to be used when transactionsneed to be reconstructed or undone, such as when one or more serversfail or when a standby database is being constructed or maintained. Redologs are typically used to track changes made to resources, which arecommitted by a database server but not yet persistently stored in thedatabase 104. At some point, logs 103a-103n are stored persistently indatabase 104. In a shared-disk system, servers 102a-102n have access tothe logs stored in persistent memory, for use in performing log-basedprocessing as described herein.

Database Redo Log Processing

The techniques described herein are described in reference to processingredo logs by one or more database servers 102a-102n. For non-limitingexamples, redo logs may be processed (1) as part of a recovery operationin response to a failure of one or more of the servers 102a-102n, (2) inthe context of constructing and/or maintaining a standby database thatmirrors database 104, and (3) as part of a recovery operation inresponse to media loss or corruption.

In general, processing a partially ordered log of work items involves atleast the following three operations: (1) reading the log entries; (2)for each data object, ordering the log entries in a sequence in whichthe work items were initially performed on the data object; and (3)applying the work items to the data object to bring the data object to astate that reflects the changes recorded in the redo log. In oneimplementation, a data object with which logged work items areassociated is at the level of a data block. Data blocks have a uniqueID, which identifies the file number and block number of the datablocks.

Embodiments of the invention involve a parallel processing scheme inwhich a high degree of resource utilization is obtained. Using thedescribed techniques, much of the work that is performed in priorapproaches by a Coordinator process, is distributed to Worker processesthat are executing in parallel. In particular, the Coordinator isrelieved of the task of reading the sequence of work items, identifyingand collecting the relevant work items for each Worker process andsending the collected streams of work items to the corresponding Workerprocesses. Instead, each of a plurality of Worker processes perform anyor all of the tasks involved with processing logged work items that eachcorrespond to a particular data object or data block. Such tasks mayinclude, for example, (1) reading a subset of the work items from a log;(2) sequentially ordering work items for corresponding data objects; (3)applying a sequentially ordered set of work items to a correspondingdata object; and (4) in some scenarios, such as with database clusters,transmitting a subset of the work items to a Worker process running onanother clustered database server instance.

However, in certain system configurations, some of the Worker processesmay not perform some of the tasks. For example, in certain clusterconfigurations and with certain hardware settings, it may not be optimalfor certain Worker processes to read and/or order logs. Hence, theseWorker processes may not read logs and/or order them, rather, theseprocesses just apply changes to subsets of data blocks.

Each of these tasks can be performed at will by the Worker processes,when the overall operation is at a suitable point. For example, workitems need to undergo the first task of processing before those workitems can undergo the second, third or fourth tasks of processing.However, at any point in time, if a given Worker process is unable toperform any of the four tasks, then the Worker process can use itsresources to perform another of the four tasks. Hence, symmetry in thework done by each Worker process is the key to ensuring that no oneprocess will significantly delay any other process from performing somework. The presence of a single point of bottleneck, i.e., theCoordinator process, is effectively eliminated. There is no need to waitfor a Coordinator process to read the entire log and/or for theCoordinator to pre-partition the log to facilitate farming out portionsof the log to the various Worker processes. In general, there is muchless idle time for the Worker processes, than with past approaches.

FIG. 2 is a flow diagram that illustrates a method for processing asequence of work items from a log, according to an embodiment of theinvention, where each work item corresponds to a particular data object.Each of blocks 202-206 is performed by each of a plurality of Workerprocesses. Furthermore, blocks 202-206 can be performed by more than oneWorker process at a point in time, and one Worker process can performone of blocks 202-206 while another Worker process is performing adifferent one of blocks 202-206. Still further, not all of the Workerprocesses that are participating in processing the log(s) necessarilyperform each of the tasks of blocks 202-206. Furthermore, while the flowdiagram of FIG. 2 may visually imply that the processing of the fourtasks are done serially, i.e., first read at block 202, then order atblock 204, then possibly send at block 208, and then apply at block 210,this is not the manner in which the processing is necessarily performedby any given Worker process. Rather, each of the Worker processes canswitch from any task to any task at any time that a task is ready to beperformed.

FIG. 3A is a block diagram that illustrates a system performing a methodfor processing sequences of work items from logs, according to anembodiment of the invention. FIG. 3A is referenced to assist indescribing the method illustrated in FIG. 2.

With reference to FIG. 3A, a database cluster includes three serversthat each generates a log for data transactions that are executed byeach respective server. Depicting three servers is arbitrary, forpurposes of explanation, and does not limit embodiments to use withthree servers only. Assume that server 2 (304) fails before persistentlystoring all of the resources, e.g., data objects, on which server 2 hascommitted changes. Further assume that server 1 (302) and server 3 (306)are performing a recovery operation based on the log from server 2, tochange those resources to reflect those committed changes made by server2 that have not yet been persistently reflected for the associatedresources. Depicting the failure of server 2 is arbitrary, for purposesof explanation, and does not limit embodiments to use with a singleserver failure only. Rather, the techniques described herein areapplicable to a multi-server failure, as well as applicable to a standbydatabase construction process in which multiple logs are processed,e.g., logs from each of the servers in the system.

Partial Ordering of Log Files

A log may consist of sequences of work items corresponding to dataobjects from multiple threads of execution in a given server (e.g.,multiple sessions with the server) and, therefore, the work items arenot necessarily in sequential order for any data object. For example,one thread may record in the log a work item related to a first object,while another thread next records in the log a work item related to adifferent second object, while yet another thread records in the log adifferent work item related to the same first object. Similarly, in thescenario in which logs from multiple servers are processed, thesequences of work items across the logs are not in sequential order forthe data objects because threads from each of the multiple servers mayrecord work items, in their respective logs, that relate to the sameobject.

The notion of partial order of work items on data objects has twoaspects. First, in the case of a single physical log file, the sequenceof work items is partially ordered with respect to the set of dataobjects referred to by the work items. The “partial” concept refers tothe idea that a work item does not have to strictly follow itspredecessor work item and does not have to strictly precede its followerwork item. So, there is no total order of work items in the log.However, there are certain order constraints, such as a work item on aparticular data object must follow another work item on the same dataobject that appears earlier in the log. In other words, application ofthe log onto any particular data object does define a total order.

The second aspect of “partial” order is relevant to the context ofhaving a set of multiple logs. That is, each log is a sequence of workitems. Hence, with respect to a particular data object, the sequence ofwork items for that data object cannot be gleaned from reading just onelog. Therefore, the entire set of sequences of work items for that dataobject must be considered. That is, a sequence from each of the logs ismerged to arrive at the total order of work items for that data object.

Parallel Read of Logs

At block 202, a subset of work items is read from a log. For example,any or all of Worker processes W11, W12 and W13 of server 1 and any orall of Worker processes W31, W32 and W33 of server 3 read the log ofserver 2. The log of server 2 may be accessed, for example, frompersistent storage. Each of a plurality of Worker processes from each ofserver 1 and server 3 is assigned to read a different portion of the logfile from server 2, i.e., a different series of work items. For example,if there are a total of eight Worker processes, with four Workerprocesses on each of two servers (e.g., two physical nodes executingdatabase management server instances), and the work items are manifestedin the log as two series of work items (each series of work items isreferred to hereafter as a “read bin”), then the read operation ispartitioned so that all four Worker processes on one server arecollectively responsible for reading from one read bin (e.g., read bin 1of FIG. 3A) and the other four Worker processes on the other server areresponsible for reading the other read bin (e.g., read bin 2 of FIG.3A). On each server, the read operation is not further partitioned.Whichever Worker process has the CPU resource to read the next chunkfrom the read bin will read it.

The number of work items (i.e., records) read by the servers during eachread operation may be based on, for example, criteria related to anefficient IO size for the relevant hardware resources rather than someother criteria imposed by the Coordinator process and requiringpartitioning of the work items based on the objects on which the workitems need to be applied. Because each of the Worker processesconcurrently reads a different portion of the log file, a Coordinatorprocess does not need to partition the work items based on theassociated objects to which the work items apply, before the Workerprocesses read the work items directly from the log. Further, theCoordinator process does not need to read each record and provide themto the Worker processes.

In one embodiment, the log(s) are read multiple times. That is, a set ofWorker processes that are responsible for applying changes to a set ofdata blocks can read the entire log(s) and process the work items onthose data blocks, while the rest of the Worker processes that areresponsible for the rest of the data blocks may also read the entirelog(s) and process the corresponding work on the rest of the datablocks.

Global Log Buffer

In one embodiment, the Worker processes store the information read fromthe log read bins in a global log buffer 300, which is accessible toserver 1 and server 3. This type of implementation is based on a systemhaving cross-machine coherent shared memory, such as a clustered cachefusion enabled system. Any of the Worker processes of server 1 andserver 3 can work on reading work items from the log even if some otherprocess is already working on ordering the work items for the dataobjects or applying work items to data objects, because the Workerprocesses are free to utilize their resources without concern forsynchronization with other processes imposed by a Coordinator.

In an alternative embodiment, the global log buffer is only globalwithin a machine, or server. Hence, Worker processes within the samemachine can always access the machine's global log buffer. However,Worker processes within a given machine cannot access the log buffer inanother machine. Therefore, work items may need to be shipped fromWorker processes on one machine to the Worker processes on remotemachines that are to apply those work items.

In another alternative embodiment, in a shared-disk cluster, orderedwork items are stored on, and subsequently read from, a “global logbuffer” in persistent storage rather than in-memory log buffers. The“global log buffer” in persistent storage is where the ordered workitems are temporarily stored so that they are retrievable by otherWorker processes for application to respective server instances.

Ordering Work Items

Because log files are only partially ordered for the set of objects towhich the work items apply, work needs to be performed to order the workitems for each corresponding data object. At block 204 an orderingoperation is performed in which the work items are sequentially orderedfor each of the corresponding data objects. The ordering operationinvolves accessing the work items from the global log buffer 300, whichcontains work items that were read from the log by participating Workerprocesses on participating servers. Any of the Worker processes W11,W12, W13 of server 1 and W31, W32, W33 of server 3 can work on theordering operation, even if some of the other processes are stillreading from the log or applying work items to data objects. A Workerprocess that performs ordering operations can order work items that theWorker process itself read from the log, or that other Worker processesread from the log. A Worker process does not have to wait on otherprocesses to complete one stage of processing before being able to workon a subsequent stage of processing. The workload is self-balancing byallowing each Worker process to work in parallel and on whatever stageof processing is currently available and on whatever stage may need helpto keep the overall processing moving forward.

For each data object that corresponds to a work item from the log beingprocessed, the work items are sequentially ordered based on, forexample, a system change number (SCN) that is associated with atransaction. SCNs are values (e.g., system timestamps) that representwhen work items have occurred relative to other work items. Therefore,SCNs, or similarly functioning mechanisms, can be used to sequentiallyorder the work items based on their relative time of occurrence.

Assigning Ordered Work Items to Bins

During log-based processing, data objects are assigned evenly to thedifferent Worker processes, for application of the work items to thecorresponding data objects. That is, applying a work item to the dataobject associated with the work item is performed by the Worker processto which the data object is assigned, or partitioned. There is norequirement regarding how the data objects are assigned to the applyingWorker processes. It is advantageous, however, to partition the objectsevenly across the applying Worker processes. This assignment of dataobjects to particular Worker processes is only for the apply operationinvolving the corresponding work items.

As part of the ordering operation of block 204, Worker processes “place”work items for a data object in sequential order in “apply bins” thatcorrespond to the Worker process that has been assigned to apply thosework items to the corresponding data objects. For example, referring toFIG. 3A, Worker processes W11, W12 and W13 from server 1 maysequentially sort work items for a set of one or more data objects thatWorker processes on server 1 and server 3 are assigned to apply.

Worker processes from server 1 place sequentially ordered work itemsthat are to be applied by Worker processes on server 1 in bins thatcorrespond to the particular applying Worker process on server 1. Referto apply bins for W11, W12, W13 in “apply bins populated by server 1”308 in FIG. 3A. Similarly, Worker processes from server 1 placesequentially ordered work items that are to be applied by Workerprocesses of server 3 in bins that correspond to the particular applyingWorker process on server 3. Refer to apply bins for W31, W32, W33 in“apply bins populated by server 1” 308 in FIG. 3A.

Likewise, Worker processes from server 3 place sequentially ordered workitems that are to be applied by Worker processes on server 1 in binsthat correspond to the particular applying Worker process on server 1.Refer to apply bins for W11, W12, W13 in “apply bins populated by server3” 310 in FIG. 3A. Similarly, Worker processes from server 3 placesequentially ordered work items that are to be applied by Workerprocesses of server 3 in bins that correspond to the particular applyingWorker process on server 3. Refer to apply bins for W31, W32, W33 in“apply bins populated by server 3” 300 in FIG. 3A.

Therefore, each apply bin holds a set of work items that correspond to aset of data objects, and which are to be applied by the particularWorker process to which the bin corresponds. Hence, the correspondingwork items can be applied to these data objects without coordination orsynchronization with other Worker processes. Furthermore, each apply bincontains all of the work items corresponding to the set of data objectscorresponding to the apply bin, which were ordered by the Workerprocesses on the particular server that populates that apply bin. In oneembodiment, the work items are not actually stored in the apply bins,rather references (e.g., pointers) to work items in the global logbuffer 300 are stored in or associated with the apply bins. In oneembodiment, the information associated with the apply bins is in adifferent buffer than the global log buffer 300.

Sending Bins to Remote Servers for Applying Work Items

At decision block 206, it is determined whether or not any of the workitems that were sequentially ordered at block 204 need to be applied bya remote Worker process. For example, the work items in apply bins forW31, W32, W33 in “apply bins populated by server 1” 308 (FIG. 3A) areassigned to be applied by remote Worker processes on server 3 and,therefore, need to be sent to the remote server (server 3), at block208. Similarly, the work items in apply bins for W11, W12, W13 in “applybins populated by server 3” 310 (FIG. 3A) are assigned to be applied byremote Worker processes on server 1 and, therefore, need to be sent tothe remote server (server 1), at block 208. A Worker process thatperforms sending operations can send work items that the Worker processitself read from the log or placed in order, or that one or more otherWorker processes read from the log and/or ordered.

In an embodiment in which the global log buffer is global across thesystem servers (as depicted in FIG. 3A), such as with cross-machinecoherent shared memory, references to the work items can be shipped toremote Worker processes rather than shipping the work item itself. In anembodiment, in which the global log buffer is global only within amachine, the work item itself is sent to a remote Worker process becausethe remote Worker process does not have access to the global log bufferof a different machine.

In the scenario in which an apply bin of ordered work items is sent to aremote process, a second merge and ordering operation may be performedby the remote Worker process if the remote Worker process receivesseparate apply bins from different servers, in order to sequentiallyorder the work items from the multiple apply bins for the correspondingdata objects.

Applying Work Items to Objects

At block 210, a sequentially ordered set of work items are applied to acorresponding data object. If at decision block 206 it is determinedthat none of the work items need to be applied by a remote Workerprocess, then sending apply bins to a remote process is unnecessary andexecution moves directly to block 210, rather than to block 210 throughblock 208. Applying the work items generally involves making the changesthat are represented by the work items to the corresponding dataobjects. For example, in the context of database recovery, thetransactions that were committed on objects but were not stored to disk(i.e., the work items in a redo log) are now applied to whatever versionof the data objects is appropriate, according to a conventional recoveryprocess. For another example, in the context of standby databaseconstruction, the changes that were committed on objects in a primarydatabase are now applied to the standby or secondary database.

Work items are applied to a given object in sequential order, but can beapplied to different objects in any order. Furthermore, in somescenarios, work items can be applied to different objects without anyconcern for ordering when there are not multiple changes made to thesame object. The Worker process can proceed with applying a change to adata object if the Worker process is sure that it has received andapplied all of the prior changes to that data object. The Worker processknows this fact when it has received the changes up to a point in thelog that is beyond the change being considered for application. Thisallows Worker processes to be flexible in their processing.

Optimization of Apply Process

In one embodiment, the work items are applied to corresponding dataobjects in order at the data block level. When applying work items todata objects, the data objects need to be read from persistent storageinto memory that is local to the applying process (e.g., a buffercache), where the changes represented by the work items are applied. Inone embodiment, the work items that are not yet applied and for whichthe corresponding data objects are not yet in local memory, are cachedlocal to the applying process. For example, the work items can beassociated with the buffer that is pending IO. Hence, while waiting forthe pending IO operation to complete so that the required objects are inlocal memory for applying the changes, the Worker process can move ontoother processing, rather than wait idly for the IO operation tocomplete. For example, while waiting for the pending IO operation tocomplete, the Worker process can work on reading, and/or ordering and/orapplying work items associated with another data object. Once the IOcompletes, the Worker process can apply the work items to the dataobjects that were just provided to local memory.

Physical Standby Configuration

FIG. 3B is a block diagram that illustrates a system performing a methodfor processing sequences of work items from logs, according to anembodiment of the invention. FIG. 3B illustrates a different contextthan that illustrated in FIG. 3A, in which log-based processing may beperformed. In FIG. 3B, two standby servers, standby server 1 (312) andstandby server 2 (314), are each processing a log from an associatedprimary server, in order to replicate the state of the associatedprimary server. That is, the process described herein is used to updatea standby database, i.e., a copy of a primary database, with changesthat are made in the primary database.

For such a system, a method for processing a sequence of work items froma log is similar to the method illustrated in FIG. 2. Because, asdepicted in FIG. 3B, there is a one-to-one relationship between eachstandby server and its related primary server, the log for a givenprimary server is read by only one standby server. If there were not aone-to-one relationship between primary and standby servers, such as ina system in which there are four primary database servers and only twostandby database servers, a standby server may read logs from more thanone primary server. However, in both scenarios, there are work itemsfrom logs from a primary server that may need to be applied by a standbyserver other than the standby server that actually read the log from theprimary server. Thus, the need for and the ability of Worker processesat one server to send work items over to Worker processes at anotherserver is still present.

Checkpointing the Process

As mentioned, a checkpointing process is commonly employed to limit theamount of rework in response to a failure. If the process describedabove fails for any reason, for example, due to a missing log or corruptlog, repeating the process from the beginning wastes resources. The goalis to minimize repeating any processing that has already been performed.

In comparison with other approaches, a more fluid checkpointing schemeis used that reduces the burdens of synchronization in parallel workprocessing. In one embodiment, the Coordinator process does not identifya future point of synchronization in the sequence of work items andrequire the Worker processes to get to that point and stop. Instead, theCoordinator examines the current state of progress of all theparticipating Worker processes and computes a past point in the sequenceof work items at which all work items before that point have beenprocessed, and persistently records this point as the checkpoint.Therefore, a global state of the process is maintained inexpensively ina distributed apply model without the need for synchronization messagesback and forth between the Coordinator and the Workers.

Essentially, the same effect is achieved by simply recording theprogress that has already been made by Worker processes, rather thanpredefining a common synchronization point for all the Worker processesto reach, which often requires some processes to wait idly for all ofthe other processes to reach that point. Consequently, a higher degreeof resource utilization is achieved as the Worker processes continueworking without stopping for checkpoint synchronization.

Each Worker maintains its current state of progress locally, which isperiodically collected by the Coordinator, from which a global state iscomputed and recorded persistently. The global state is a global lowwatermark, which represents a common point in the work item applyprocess that each applying process has reached, i.e., a checkpoint. Thischeckpoint can be characterized by a SCN associated with the commonsequence of work items. Hence, upon a failure of the process, it isknown that no applying of work items before the checkpoint needs to berepeated. Furthermore, the Coordinator tracks high watermarks for eachprocess, which represents the latest point in the sequence of work itemsthat each Worker process has reached. Hence, upon a failure of theprocess, it is known from which point various Worker processes may needto be brought back to the checkpoint in order to bring all the Workerprocesses to a common point of applying.

The manner in which the Coordinator collects the states of the Workerprocesses may vary from implementation to implementation. In oneembodiment, the Worker processes periodically push this information tothe Coordinator. In addition, the Coordinator may periodically push thecheckpoint to the Worker processes so that the Worker processes areaware of the global progress of the process.

Process Implementations

The log-based processing described herein is described primarily in thecontext of implementations for a database recovery process and a primarydatabase-standby database synchronization process. However, these arenot the only contexts in which the techniques may be implemented. Fornon-limiting examples, the techniques described herein are alsoapplicable in the context of mirroring software and storage units. Thesetypes of systems also ship changes in logs to remote sites (althoughthey may not be referred to explicitly as “logs”). Hence, the techniquesmay be used, for example, for keeping a remote filesystem synchronizedwith changes made at an original filesystem and keeping remote storageunits synchronized with changes made in original storage units.

Hardware Overview

FIG. 4 is a block diagram that illustrates a computer system 400 uponwhich an embodiment of the invention may be implemented. Computer system400 includes a bus 402 or other communication mechanism forcommunicating information, and a processor 404 coupled with bus 402 forprocessing information. Computer system 400 also includes a main memory406, such as a random access memory (RAM) or other dynamic storagedevice, coupled to bus 402 for storing information and instructions tobe executed by processor 404. Main memory 406 also may be used forstoring temporary variables or other intermediate information duringexecution of instructions to be executed by processor 404. Computersystem 400 further includes a read only memory (ROM) 408 or other staticstorage device coupled to bus 402 for storing static information andinstructions for processor 404. A storage device 410, such as a magneticdisk, optical disk, or magneto-optical disk, is provided and coupled tobus 402 for storing information and instructions.

Computer system 400 may be coupled via bus 402 to a display 412, such asa cathode ray tube (CRT) or a liquid crystal display (LCD), fordisplaying information to a computer user. An input device 414,including alphanumeric and other keys, is coupled to bus 402 forcommunicating information and command selections to processor 404.Another type of user input device is cursor control 416, such as amouse, a trackball, or cursor direction keys for communicating directioninformation and command selections to processor 404 and for controllingcursor movement on display 412. This input device typically has twodegrees of freedom in two axes, a first axis (e.g., x) and a second axis(e.g., y), that allows the device to specify positions in a plane.

The invention is related to the use of computer system 400 forimplementing the techniques described herein. According to oneembodiment of the invention, those techniques are performed by computersystem 400 in response to processor 404 executing one or more sequencesof one or more instructions contained in main memory 406. Suchinstructions may be read into main memory 406 from anothercomputer-readable medium, such as storage device 410. Execution of thesequences of instructions contained in main memory 406 causes processor404 to perform the process steps described herein. In alternativeembodiments, hard-wired circuitry may be used in place of or incombination with software instructions to implement the invention. Thus,embodiments of the invention are not limited to any specific combinationof hardware circuitry and software.

The term “computer-readable medium” as used herein refers to any mediumthat participates in providing instructions to processor 404 forexecution. Such a medium may take many forms, including but not limitedto, non-volatile media, volatile media, and transmission media.Non-volatile media includes, for example, optical, magnetic, ormagneto-optical disks, such as storage device 410. Volatile mediaincludes dynamic memory, such as main memory 406. Transmission mediaincludes coaxial cables, copper wire and fiber optics, including thewires that comprise bus 402. Transmission media can also take the formof acoustic or light waves, such as those generated during radio-waveand infra-red data communications.

Common forms of computer-readable media include, for example, a floppydisk, a flexible disk, hard disk, magnetic tape, or any other magneticmedium, a CD-ROM, any other optical medium, punchcards, papertape, anyother physical medium with patterns of holes, a RAM, a PROM, and EPROM,a FLASH-EPROM, any other memory chip or cartridge, a carrier wave asdescribed hereinafter, or any other medium from which a computer canread.

Various forms of computer readable media may be involved in carrying oneor more sequences of one or more instructions to processor 404 forexecution. For example, the instructions may initially be carried on amagnetic disk of a remote computer. The remote computer can load theinstructions into its dynamic memory and send the instructions over atelephone line using a modem. A modem local to computer system 400 canreceive the data on the telephone line and use an infra-red transmitterto convert the data to an infra-red signal. An infra-red detector canreceive the data carried in the infra-red signal and appropriatecircuitry can place the data on bus 402. Bus 402 carries the data tomain memory 406, from which processor 404 retrieves and executes theinstructions. The instructions received by main memory 406 mayoptionally be stored on storage device 410 either before or afterexecution by processor 404.

Computer system 400 also includes a communication interface 418 coupledto bus 402. Communication interface 418 provides a two-way datacommunication coupling to a network link 420 that is connected to alocal network 422. For example, communication interface 418 may be anintegrated services digital network (ISDN) card or a modem to provide adata communication connection to a corresponding type of telephone line.As another example, communication interface 418 may be a local areanetwork (LAN) card to provide a data communication connection to acompatible LAN. Wireless links may also be implemented. In any suchimplementation, communication interface 418 sends and receiveselectrical, electromagnetic or optical signals that carry digital datastreams representing various types of information.

Network link 420 typically provides data communication through one ormore networks to other data devices. For example, network link 420 mayprovide a connection through local network 422 to a host computer 424 orto data equipment operated by an Internet Service Provider (ISP) 426.ISP 426 in turn provides data communication services through the worldwide packet data communication network now commonly referred to as the“Internet” 428. Local network 422 and Internet 428 both use electrical,electromagnetic or optical signals that carry digital data streams. Thesignals through the various networks and the signals on network link 420and through communication interface 418, which carry the digital data toand from computer system 400, are exemplary forms of carrier wavestransporting the information.

Computer system 400 can send messages and receive data, includingprogram code, through the network(s), network link 420 and communicationinterface 418. In the Internet example, a server 430 might transmit arequested code for an application program through Internet 428, ISP 426,local network 422 and communication interface 418.

The received code may be executed by processor 404 as it is received,and/or stored in storage device 410, or other non-volatile storage forlater execution. In this manner, computer system 400 may obtainapplication code in the form of a carrier wave.

Extensions and Alternatives

In the foregoing description, embodiments of the invention have beendescribed with reference to numerous specific details that may vary fromimplementation to implementation. Thus, the sole and exclusive indicatorof what is the invention, and is intended by the applicants to be theinvention, is the set of claims that issue from this application, in thespecific form in which such claims issue, including any subsequentcorrection. Any definitions expressly set forth herein for termscontained in such claims shall govern the meaning of such terms as usedin the claims. Hence, no limitation, element, property, feature,advantage or attribute that is not expressly recited in a claim shouldlimit the scope of such claim in any way. Therefore, the specificationand drawings are, accordingly, to be regarded in an illustrative ratherthan a restrictive sense.

In addition, in this description certain process steps are set forth ina particular order, and alphabetic and alphanumeric labels may be usedto identify certain steps. Unless specifically stated in thedescription, embodiments of the invention are not necessarily limited toany particular order of carrying out such steps. In particular, thelabels are used merely for convenient identification of steps, and arenot intended to specify or require a particular order of carrying outsuch steps.

What is claimed is:
 1. A method for processing sequences of work itemsfrom a log, wherein each work item in said log corresponds to aparticular data object of a plurality of data objects, the methodcomprising computer-implemented steps of: each worker process of aplurality of worker processes producing a sequentially ordered set ofwork items of a plurality of sequentially ordered set of work items,wherein said sequentially ordered set of work items corresponds to arespective data object of said plurality of data objects, whereinproducing a sequentially ordered set of work items of a plurality ofsequentially ordered set of work items comprises: reading, from saidlog, a subset of the work items, said subset having a sequential logorder; based on the respective data object for said each worker process,ordering work items in said sequentially ordered set of work items;wherein no work item in any other sequentially ordered set of work itemsof said plurality of sequentially ordered set of work items correspondsto respective one or more data objects of said sequentially ordered setof work items; and wherein the sequentially ordered set of work items isdifferent than any other sequentially ordered set of work items of saidplurality of sequentially ordered set of work items; wherein, each dataobject of said plurality of data objects corresponds to at least onework item of said plurality of sequentially ordered set of work items;and wherein the method is performed by one or more computer devices. 2.The method of claim 1, wherein the step of reading is performed by saideach worker processes of said plurality of worker processes withoutpartitioning of the work items by a coordinator process prior to thestep of reading.
 3. The method of claim 2, wherein the step of producingsaid sequentially ordered set of work items is performed by said eachworker processes without receiving the subset of work items from thecoordinator process.
 4. The method of claim 1, wherein a set of workerprocesses includes said plurality of worker processes, wherein said logis a global log buffer, wherein the steps further include a first workerprocess of said set of worker processes adding one or more sequences ofwork items to said global log buffer.
 5. The method of claim 4, whereinsaid plurality of processes includes said first worker process.
 6. Themethod of claim 4, wherein said plurality of processes does not includesaid first worker process.
 7. The method of claim 1, wherein the stepsfurther include a set of worker processes applying said plurality ofsequentially ordered work items to said plurality of data objects. 8.The method of claim 7, wherein the second worker process belongs toplurality of worker processes.
 9. The method of claim 7, wherein saidsecond worker process does not belong to said plurality of workerprocesses.
 10. The method of claim 7, wherein each of the plurality ofworker processes is associated with one of a plurality of servers thatare communicatively interconnected, wherein each of the set of workerprocesses is associated with one of said plurality of servers, whereinthe first worker process is associated with a first server of saidplurality of servers, wherein the second worker process is associatedwith a second server of said plurality of servers, the method furthercomprising the computer-implemented step of the first worker processsending a first sequentially ordered set of work items for acorresponding data object to the second worker process, for applying thefirst sequentially ordered set of work items to a corresponding dataobject.
 11. The method of claim 1, wherein each of the plurality ofworker processes is associated with one of a plurality of servers thatare communicatively interconnected, wherein the log comprises work itemsassociated with at least two of the plurality of servers.
 12. The methodof claim 7, further comprising the computer-implemented steps of: byeach of the set of worker processes, periodically providing to acoordinator process, an identifier of the most recent work item, fromthe log, that the worker process has applied; by the coordinatorprocess, persistently storing a global checkpoint that identifies aparticular location in the sequences of work items in the log, that allof the plurality of worker processes have reached in the step ofapplying work items to corresponding data objects, and periodicallyproviding, to each of the set of worker processes, the globalcheckpoint.
 13. The method of claim 12, further comprising thecomputer-implemented steps of: by the coordinator process, persistentlystoring the identifier of the most recent work item that each workerprocess of the set of worker processes has applied.
 14. The method ofclaim 7, wherein applying comprises applying, by a first worker processof the set of worker processes, a first sequentially ordered set of workitems of said plurality of sequentially ordered set of work items to afirst data object and a second sequentially ordered set of work items ofsaid plurality of sequentially ordered set of work items to a seconddata object, the method further comprising: by the first worker process,caching the first sequentially ordered set of work items into a cacheaccessible to the first worker process; while waiting for the first dataobject to be loaded from persistent storage into volatile memoryaccessible to the first worker process, applying the second sequentiallyordered set of work items to the second data object; and once the firstdata object is loaded into the volatile memory accessible to the firstworker process, then reading the first sequentially ordered set of workitems from the cache, and applying the first sequentially ordered set ofwork items to the first data object in the volatile memory.
 15. Themethod of claim 7, wherein the step of applying includes applyingsequentially ordered sets of work items to corresponding data objects,one data object at a time in any order of data objects.
 16. The methodof claim 1, wherein the step of reading work items from said logincludes reading, by a first set of worker processes of said pluralityof worker processes, from said log; and reading, by a second set ofworker processes that is different than the first set of said pluralityof worker processes, from said log.
 17. The method of claim 1, whereinthe method for processing sequences of work items from said log isperformed by a group of worker processes, wherein the plurality ofworker processes are a subset of the group of worker processes, andwherein the steps of reading and producing are not performed by everyworker process of the group of worker processes.
 18. The method of claim7, wherein the steps of reading, producing and applying are performed aspart of a database recovery process performed in response to a failureof one or more database management servers.
 19. The method of claim 7,wherein the steps of reading, producing and applying are performed aspart of a database recovery process performed in response to corruptionor loss of persistently stored data managed by one or more databasemanagement servers.
 20. The method of claim 7, wherein the steps ofreading, producing and applying are performed as part of a process ofupdating a copy of a database with changes made at a database from whichthe copy was derived.
 21. The method of claim 7, wherein the steps ofreading, producing and applying are performed as part of a process ofupdating a copy of a file system with changes made at a file system fromwhich the copy was derived.
 22. The method of claim 7, wherein the stepsof reading, producing and applying are performed as part of a process ofupdating a copy of data stored on one or more storage units with changesmade at one or more storage units from which the copy was derived.
 23. Anon-transitory computer-readable storage medium storing one or moresequences of instructions for processing sequences of work items from alog, which sequences of instructions, when executed by one or moreprocessors, cause performance of steps comprising each worker process ofa plurality of worker processes producing a sequentially ordered set ofwork items of a plurality of sequentially ordered set of work items,wherein said sequentially ordered set of work items corresponds to arespective data object of said plurality of data objects, whereinproducing a sequentially ordered set of work items of a plurality ofsequentially ordered set of work items comprises: reading, from saidlog, a subset of the work items, said subset having a sequential logorder, wherein each work item in said subset of work items correspondsto a data object of said plurality of data objects; based on therespective one or more data object for said each worker process,ordering work items in said sequentially ordered set of work items;wherein no work item in any other sequentially ordered set of work itemsof said plurality of sequentially ordered set of work items correspondsto respective one or more data object of said sequentially ordered setof work items; and wherein the sequentially ordered set of work items isdifferent than any other sequentially ordered set of work items of saidplurality of sequentially ordered set of work items; wherein, each dataobject of said plurality of data objects corresponds to at least onework item of said plurality of sequentially ordered sets of work items;and wherein the method is performed by one or more computer devices. 24.The non-transitory computer-readable storage medium of claim 23, whereinthe step of reading is performed by said each worker processes of saidplurality of worker processes without partitioning of the work items bya coordinator process prior to the step of reading.
 25. Thenon-transitory computer-readable storage medium of claim 24, wherein thestep of producing said sequentially ordered set of work items isperformed by said each worker processes without receiving the subset ofwork items from the coordinator process.
 26. The non-transitorycomputer-readable storage medium of claim 23, wherein a set of workerprocesses includes said plurality of worker processes, wherein said logis a global log buffer, wherein the steps further include a first workerprocess of said set of worker processes adding one or more sequences ofwork items to said global log buffer.
 27. The non-transitorycomputer-readable storage medium of claim 26, wherein said plurality ofprocesses includes said first worker process.
 28. The non-transitorycomputer-readable storage medium of claim 26, wherein said plurality ofprocesses does not include said first worker process.
 29. Thenon-transitory computer-readable storage medium of claim 23, wherein thesteps further include a set of worker processes applying said pluralityof sequentially ordered work items to said plurality of data objects.30. The non-transitory computer-readable storage medium of claim 29,wherein the second worker process belongs to plurality of workerprocesses.
 31. The non-transitory computer-readable storage medium ofclaim 29, wherein said second worker process does not belong to saidplurality of worker processes.
 32. The non-transitory computer-readablestorage medium of claim 29, wherein each of the plurality of workerprocesses is associated with one of a plurality of servers that arecommunicatively interconnected, wherein each of the set of workerprocesses is associated with one of said plurality of servers, whereinthe first worker process is associated with a first server of saidplurality of servers, wherein the second worker process is associatedwith a second server of said plurality of servers, wherein the stepsfurther comprise the first worker process sending a first sequentiallyordered set of work items for a corresponding data object to the secondworker process, for applying the first sequentially ordered set of workitems to a corresponding data object.
 33. The non-transitorycomputer-readable storage medium of claim 29, wherein each of theplurality of worker processes is associated with one of a plurality ofservers that are communicatively interconnected, wherein the logcomprises work items associated with at least two of the plurality ofservers.
 34. The non-transitory computer-readable storage medium ofclaim 29, the steps further comprising: by each of the set of workerprocesses, periodically providing to a coordinator process, anidentifier of the most recent work item, from the log, that the workerprocess has applied; by the coordinator process, persistently storing aglobal checkpoint that identifies a particular location in the sequencesof work items in the log, that all of the plurality of worker processeshave reached in the step of applying work items to corresponding dataobjects, and periodically providing, to each of the set of workerprocesses, the global checkpoint.
 35. The non-transitorycomputer-readable storage medium of claim 34, the steps furthercomprising: by the coordinator process, persistently storing theidentifier of the most recent work item that each worker process of theset of worker processes has applied.
 36. The non-transitorycomputer-readable storage medium of claim 29, wherein applying comprisesapplying, by a first worker process of the set of worker processes, afirst sequentially ordered set of work items of said plurality ofsequentially ordered set of work items to a first data object and asecond sequentially ordered set of work items of said plurality ofsequentially ordered set of work items to a second data object, thesteps further comprising: by the first worker process, caching the firstsequentially ordered set of work items into a cache accessible to thefirst worker process; while waiting for the first data object to beloaded from persistent storage into volatile memory accessible to thefirst worker process, applying the second sequentially ordered set ofwork items to the second data object; and once the first data object isloaded into the volatile memory accessible to the first worker process,then reading the first sequentially ordered set of work items from thecache, and applying the first sequentially ordered set of work items tothe first data object in the volatile memory.
 37. The non-transitorycomputer-readable storage medium of claim 29, wherein the step ofapplying includes applying sequentially ordered sets of work items tocorresponding data objects, one data object at a time in any order ofdata objects.
 38. The non-transitory computer-readable storage medium ofclaim 23, wherein the step of reading work items from said log includesreading, by a first set of worker processes of said plurality of workerprocesses, from said log; and reading, by a second set of workerprocesses that is different than the first set of said plurality ofworker processes, from said log.
 39. The non-transitorycomputer-readable storage medium of claim 23, wherein the non-transitorycomputer-readable storage medium for processing sequences of work itemsfrom said log is performed by a group of worker processes, wherein theplurality of worker processes are a subset of the group of workerprocesses, and wherein the steps of reading and producing are notperformed by every worker process of the group of worker processes. 40.The non-transitory computer-readable storage medium of claim 23, whereina worker process of said plurality of worker processes produces said oneor more sequentially ordered sets of work items based, at least in part,on work items that were read from said log, by worker processes fromsaid plurality of worker processes other than said worker process. 41.The non-transitory computer-readable storage medium of claim 23 whereinthe log is one of a plurality of logs from which said plurality ofworker processes read said work items.
 42. The method of claim 1,wherein at least two work items correspond to a particular data objectof said plurality of data objects, wherein a particular sequentiallyordered set of work items of said plurality of ordered sets of workitems corresponds to said particular data object and contains said atleast two work items, wherein a relative order of said at least workitems in said subset of work items differs from the relative order ofsaid at least work items in said particular sequentially ordered set ofwork items.
 43. The non-transitory computer-readable storage medium ofclaim 23, wherein at least two work items correspond to a particulardata object of said plurality of data objects, wherein a particularsequentially ordered set of work items of said plurality of ordered setsof work items corresponds to said particular data object and containssaid at least two work items, wherein a relative order of said at leastwork items in said subset of work items differs from the relative orderof said at least work items in said particular sequentially ordered setof work items.
 44. The non-transitory computer-readable storage mediumof claim 32, wherein the steps of reading, producing and applying areperformed as part of a database recovery process performed in responseto a failure of one or more database management servers.
 45. Thenon-transitory computer-readable storage medium of claim 29, wherein thesteps of reading, producing and applying are performed as part of adatabase recovery process performed in response to corruption or loss ofpersistently stored data managed by one or more database managementservers.
 46. The non-transitory computer-readable storage medium ofclaim 29, wherein the steps of reading, producing and applying areperformed as part of a process of updating a copy of a database withchanges made at a database from which the copy was derived.
 47. Thenon-transitory computer-readable storage medium of claim 29, wherein thesteps of reading, producing and applying are performed as part of aprocess of updating a copy of a file system with changes made at thedatabase from which the copy was derived.
 48. The non-transitorycomputer-readable storage medium of claim 29, wherein the steps ofreading, producing and applying are performed as part of a process ofupdating a copy of a file system with changes made at the file systemfrom which the copy was derived.
 49. The non-transitorycomputer-readable storage medium of claim 29, wherein a worker processof said plurality of worker processes produces said one or moresequentially ordered sets of work items based, at least in part, on workitems that were read from said log, by worker processes from saidplurality of worker processes other than said worker process.
 50. Thenon-transitory computer-readable storage medium of claim 29 wherein thelog is one of a plurality of logs from which said plurality of workerprocesses read said work items.
 51. A method for processing sequences ofwork items from a log, wherein each work item from said log correspondsto a particular data object of a plurality of data objects, wherein themethod comprises computer-implemented steps of: each worker process, ofa plurality of worker processes, producing a respective sequentiallyordered set of work items belonging to a plurality of sequentiallyordered sets of work items, wherein said respective sequentially orderedset of work items corresponds to a respective data object of saidplurality of data objects, wherein said each worker process producing arespective sequentially ordered set of work items comprises said eachworker process: reading, directly from said log, work items, whereinonly a portion of work items directly read from said log by said eachworker process corresponds to the respective data object assigned tosaid each worker process; ordering said work items that correspond tothe respective data object to form said respective sequentially orderedset of work items; wherein no work item in any other respectivesequentially ordered set of work items produced by any worker process ofsaid plurality of worker processes corresponds to said respective dataobject; and wherein the sequentially ordered set of work items isdifferent than any other sequentially ordered sets of work items of saidplurality of sequentially ordered sets of work items; wherein each dataobject of said plurality of data objects corresponds to at least onework item of said plurality of sequentially ordered sets of work items;and wherein the method is performed by one or more computer devices. 52.The method of claim 51, wherein for said each worker process of saidplurality of worker processes the step of reading is performed by saideach worker process without partitioning of the work items by acoordinator process prior to the step of reading.
 53. The method ofclaim 52, wherein the step of producing said sequentially ordered set ofwork items is performed by said each worker process without receivingthe work items that correspond to the respective data object from thecoordinator process.
 54. The method of claim 51, wherein a set of workerprocesses includes said plurality of worker processes, wherein said logis a global log buffer, and wherein the steps further include a firstworker process of said set of worker processes adding one or moresequences of work items to said global log buffer.
 55. The method ofclaim 54, wherein said plurality of worker processes includes said firstworker process.
 56. The method of claim 54, wherein said plurality ofworker processes does not include said first worker process.
 57. Themethod of claim 51, wherein the steps further include a set of workerprocesses applying said plurality of sequentially ordered sets of workitems to said plurality of data objects.
 58. The method of claim 57,wherein a particular worker process belongs to said plurality of workerprocesses.
 59. The method of claim 57, wherein a particular workerprocess does not belong to said plurality of worker processes.
 60. Themethod of claim 57, further comprising the computer-implemented stepsof: by each of the set of worker processes, periodically providing to acoordinator process, an identifier of a most recent work item, from thelog, that the worker process has applied; and by the coordinatorprocess, persistently storing a global checkpoint that identifies aparticular location in the work items in the log, that all of the set ofworker processes have reached in the step of applying work items tocorresponding data objects, and periodically providing, to each of theset of worker processes, the global checkpoint.
 61. The method of claim60, further comprising the computer-implemented steps of: by thecoordinator process, persistently storing the identifier of the mostrecent work item that each worker process of the set of worker processeshas applied.
 62. The method of claim 57, wherein applying comprisesapplying, by a first worker process of the set of worker processes, afirst sequentially ordered set of work items of said plurality ofsequentially ordered set of work items to a first data object and asecond sequentially ordered set of work items of said plurality ofsequentially ordered set of work items to a second data object, themethod further comprising: by the first worker process, caching thefirst sequentially ordered set of work items into a cache accessible tothe first worker process; while waiting for the first data object to beloaded from persistent storage into volatile memory accessible to thefirst worker process, applying the second sequentially ordered set ofwork items to the second data object; and once the first data object isloaded into the volatile memory accessible to the first worker process,then reading the first sequentially ordered set of work items from thecache, and applying the first sequentially ordered set of work items tothe first data object in the volatile memory.
 63. The method of claim57, wherein the steps of producing and applying are performed as part ofa database recovery process performed in response to a failure of one ormore database management servers.
 64. The method of claim 57, whereinthe steps of producing and applying are performed as part of a databaserecovery process performed in response to corruption or loss ofpersistently stored data managed by one or more database managementservers.
 65. The method of claim 57, wherein the steps of producing andapplying are performed as part of a process of updating a copy of adatabase with changes made at a database from which the copy wasderived.
 66. The method of claim 57, wherein the steps of producing andapplying are performed as part of a process of updating a copy of a filesystem with changes made at a file system from which the copy wasderived.
 67. The method of claim 57, wherein the steps of producing andapplying are performed as part of a process of updating a copy of datastored on one or more storage units with changes made at one or morestorage units from which the copy was derived.
 68. The method of claim51, wherein each of the plurality of worker processes is associated withone of a plurality of servers that are communicatively interconnected,and wherein the log comprises work items associated with at least two ofthe plurality of servers.
 69. The method of claim 51, wherein the stepof reading work items-includes: reading, by a first set of workerprocesses of said plurality of worker processes, from said log; andreading, by a second set of worker processes that is different than thefirst set of said plurality of worker processes, from said log.
 70. Themethod of claim 51, wherein the method for processing sequences of workitems from a log is performed by a group of worker processes, andwherein the plurality of worker processes is a subset of the group ofworker processes.
 71. The method of claim 51, wherein the log is one ofa plurality of logs from which said plurality of worker processes readsaid work items.
 72. The method of claim 51, wherein at least two workitems correspond to a particular data object of said plurality of dataobjects, wherein a particular sequentially ordered set of work items ofsaid plurality of sequentially ordered sets of work items corresponds tosaid particular data object and contains said at least two work items,and wherein a relative order of said at least two work items in said logdiffers from the relative order of said at least two work items in saidparticular sequentially ordered set of work items.
 73. The method ofclaim 51, wherein the log is stored in a shared memory accessible to theplurality of worker processes.
 74. A non-transitory computer-readablestorage medium storing one or more sequences of instructions forprocessing sequences of work items from a log, wherein each work itemfrom said log corresponds to a particular data object of a plurality ofdata objects, which sequences of instructions, when executed by one ormore processors, cause performance of steps comprising: each workerprocess, of a plurality of worker processes, producing a respectivesequentially ordered set of work items belonging to a plurality ofsequentially ordered sets of work items, wherein said respectivesequentially ordered set of work items corresponds to a respective dataobject of said plurality of data objects, wherein said each workerprocess producing a respective sequentially ordered set of work itemscomprises said each worker process: reading, directly from said log,work items, wherein only a portion of work items directly read from saidlog by said each worker process corresponds to the respective dataobject assigned to said each worker process; ordering said work itemsthat correspond to the respective data object to form said respectivesequentially ordered set of work items; wherein no work item in anyother respective sequentially ordered set of work items produced by anyworker process of said plurality of worker processes corresponds to saidrespective data object; wherein the sequentially ordered set of workitems is different than any other sequentially ordered set of work itemsof said plurality of sequentially ordered sets of work items; andwherein each data object of said plurality of data objects correspondsto at least one work item of said plurality of sequentially ordered setsof work items.
 75. The non-transitory computer-readable storage mediumof claim 74, wherein for said each worker process of said plurality ofworker processes the step of reading is performed by said each workerprocess without partitioning of the work items by a coordinator processprior to the step of reading.
 76. The non-transitory computer-readablestorage medium of claim 75, wherein the step of producing saidsequentially ordered set of work items is performed by said each workerprocess without receiving the-work items that correspond to therespective data object from the coordinator process.
 77. Thenon-transitory computer-readable storage medium of claim 74, wherein aset of worker processes includes said plurality of worker processes,wherein said log is a global log buffer, and wherein the steps furtherinclude a first worker process of said set of worker processes addingone or more sequences of work items to said global log buffer.
 78. Thenon-transitory computer-readable storage medium of claim 77, whereinsaid plurality of worker processes includes said first worker process.79. The non-transitory computer-readable storage medium of claim 77,wherein said plurality of worker processes does not include said firstworker process.
 80. The non-transitory computer-readable storage mediumof claim 74, wherein the steps further include a set of worker processesapplying said plurality of sequentially ordered sets of work items tosaid plurality of data objects.
 81. The non-transitory computer-readablestorage medium of claim 80, wherein a particular worker process belongsto said plurality of worker processes.
 82. The non-transitorycomputer-readable storage medium of claim 80, wherein a particularworker process does not belong to said plurality of worker processes.83. The non-transitory computer-readable storage medium of claim 80,further comprising additional instructions that cause: by each of theset of worker processes, periodically providing to a coordinatorprocess, an identifier of a most recent work item, from the log, thatthe worker process has applied; by the coordinator process, persistentlystoring a global checkpoint that identifies a particular location in thework items in the log, that all of the set of worker processes havereached in the step of applying work items to corresponding dataobjects, and periodically providing, to each of the set of workerprocesses, the global checkpoint.
 84. The non-transitorycomputer-readable storage medium of claim 83, further comprisingadditional instructions that cause: by the coordinator process,persistently storing the identifier of the most recent work item thateach worker process of the set of worker processes has applied.
 85. Thenon-transitory computer-readable storage medium of claim 80, whereinapplying comprises applying, by a first worker process of the set ofworker processes, a first sequentially ordered set of work items of saidplurality of sequentially ordered set of work items to a first dataobject and a second sequentially ordered set of work items of saidplurality of sequentially ordered set of work items to a second dataobject, the non-transitory computer-readable storage medium furthercomprising additional instructions that cause: by the first workerprocess, caching the first sequentially ordered set of work items into acache accessible to the first worker process; while waiting for thefirst data object to be loaded from persistent storage into volatilememory accessible to the first worker process, applying the secondsequentially ordered set of work items to the second data object; andonce the first data object is loaded into the volatile memory accessibleto the first worker process, then reading the first sequentially orderedset of work items from the cache, and applying the first sequentiallyordered set of work items to the first data object in the volatilememory.
 86. The non-transitory computer-readable storage medium of claim80, wherein the steps of producing and applying are performed as part ofa database recovery process performed in response to a failure of one ormore database management servers.
 87. The non-transitorycomputer-readable storage medium of claim 80, wherein the steps ofproducing and applying are performed as part of a database recoveryprocess performed in response to corruption or loss of persistentlystored data managed by one or more database management servers.
 88. Thenon-transitory computer-readable storage medium of claim 80, wherein thesteps of producing and applying are performed as part of a process ofupdating a copy of a database with changes made at the database fromwhich the copy was derived.
 89. The non-transitory computer-readablestorage medium of claim 80, wherein the steps of producing and applyingare performed as part of a process of updating a copy of a file systemwith changes made at a file system from which the copy was derived. 90.The non-transitory computer-readable storage medium of claim 80, whereinthe steps of producing and applying are performed as part of a processof updating a copy of data stored on one or more storage units withchanges made at one or more storage units from which the copy wasderived.
 91. The non-transitory computer-readable storage medium ofclaim 80, wherein the log is one of a plurality of logs from which saidplurality of worker processes read said work items.
 92. Thenon-transitory computer-readable storage medium of claim 74, whereineach of the plurality of worker processes is associated with one of aplurality of servers that are communicatively interconnected, andwherein the log comprises work items associated with at least two of theplurality of servers.
 93. The non-transitory computer-readable storagemedium of claim 74, wherein the step of reading work items includes:reading, by a first set of worker processes of said plurality of workerprocesses, from said log; and reading, by a second set of workerprocesses that is different than the first set of said plurality ofworker processes, from said log.
 94. The non-transitorycomputer-readable storage medium of claim 74, wherein the non-transitorycomputer-readable storage medium for processing sequences of work itemsfrom a log is performed by a group of worker processes, and wherein theplurality of worker processes is a subset of the group of workerprocesses.
 95. The non-transitory computer-readable storage medium ofclaim 74, wherein at least two work items correspond to a particulardata object of said plurality of data objects, wherein a particularsequentially ordered set of work items of said plurality of sequentiallyordered sets of work items corresponds to said particular data objectand contains said at least two work items, and wherein a relative orderof said at least two work items in said log differs from the relativeorder of said at least two work items in said particular sequentiallyordered set of work items.